A List of Candidate Cancer Biomarkers for Targeted Proteomics

We have compiled from literature and other sources a list of 1261 proteins believed to be differentially expressed in human cancer. These proteins, only some of which have been detected in plasma to date, represent a population of candidate plasma biomarkers that could be useful in early cancer detection and monitoring given sufficiently sensitive specific assays. We have begun to prioritize these markers for future validation by frequency of literature citations, both total and as a function of time. The candidates include proteins involved in oncogenesis, angiogenesis, development, differentiation, proliferation, apoptosis, hematopoiesis, immune and hormonal responses, cell signaling, nucleotide function, hydrolysis, cellular homing, cell cycle and structure, the acute phase response and hormonal control. Many have been detected in studies of tissue or nuclear components; nevertheless we hypothesize that most if not all should be present in plasma at some level. Of the 1261 candidates only 9 have been approved as “tumor associated antigens” by the FDA. We propose that systematic collection and large-scale validation of candidate biomarkers would fill the gap currently existing between basic research and clinical use of advanced diagnostics.


Introduction
The study of cancer biomarker proteins began in 1847 with the discovery by Henry Bence-Jones of what turned out, more than 100 years later, to be a tumor-produced free antibody light chain "Bence Jones protein" in the urine of a multiple myeloma patient (Bence-Jones 1847; Kyle 1994) where it was present in large quantities and could be revealed by simple heat denaturation. One hundred and 40 years later this protein was demonstrated to be present also in the serum (Sinclair et al. 1986), and in 1998 a routine immunodiagnostic test was approved by the FDA. Hormones produced by tumors were also detected early on (Chan and Sell 1999): adrenocorticotropic hormone (ACTH), calcitonin, and chorionic gonadotropin (hCG), for example, are elevated in specifi c cancer types, though not with the tumor specifi city of Bence-Jones proteins.
Unfortunately, the paradigm in which an overproduced tumor-specifi c protein can be easily detected as a marker of cancer has turned out to be the exception rather than the rule: in the nearly 160 years since Bence-Jones' discovery, less than 10 proteins have progressed to the level of FDA-approved cancer diagnostic tests, and most of these lack ideal sensitivity and specifi city for cancer.
In recent years "… the emerging science of genomics and proteomics have generated a plethora of candidate cancer biomarkers" (Pritzker 2002). Unfortunately few of these markers immediately stand out as superior prognostic or diagnostic tools, and even fewer have been validated and approved. Several factors might account for the slow pace of advance in cancer biomarkers. On the one hand, available proteomics technology has limited power to detect low-abundance cancer biomarkers against the background of high-abundance plasma proteins, and many of the best markers may thus be missed until discovery technology improves. On the other hand, the capacity to verify and validate existing candidate markers (through rigorous testing in large sample sets from many diseases) is limited, and it is therefore possible that the required biomarkers have already been "discovered" but not yet validated. In this paper, we are concerned with the latter possibility, and specifi cally with the problem of selecting among the existing candidates those that are most promising for systematic validation.
This line of enquiry immediately raises the question: where is the list of known candidate cancer biomarkers? While a number of useful reviews and books discuss specifi c cancer markers with clinical promise, these generally concentrate on proven, or at least well-developed, markers or specifi c disease states. We were unable to fi nd a list that draws together a large population of candidates at all stages of development from multiple discovery sources, and thus our fi rst step has been to create one through a combination of literature search and other methods.
The value of a list of existing candidates could be limited by the general lack of sensitivity and specifi city exhibited by most of the cancer markers found to date, a factor that may have discouraged others from undertaking this task previously. Most candidates that have been followed up in larger studies have shown poor diagnostic value (Table 1), and even those that have been approved for clinical use exhibit lower sensitivity and specifi city than the well-known markers of, eg, acute cardiovascular events (ie, troponin in myocardial infarction or B-type natriuretic peptide in congestive heart failure, Table 1).
On the other hand, there seems to be a growing consensus that panels of markers may be able to supply the specifi city and sensitivity that individual markers lack. For example a panel combining four known biomarkers (leptin, prolactin, osteopontin, insulin-like growth factor II), none of which used alone could distinguish patients from the controls, achieved a sensitivity and specifi city of 95% for the diagnosis of ovarian cancer (Mor et al. 2005). In this case a combination of known proteins in a novel panel provided a signifi cant advance. Xiao et al. identifi ed 299 proteins in tissue culture by 1-D page and nano-ESI-MS/MS but then used ELISA to test 13 of the most interesting in serum. They reported that CD98, fascin, the secreted chain of the polymeric immunoglobulin receptor and 14-3-3 eta provide greater sensitivity when used together as a panel than any of the markers used alone (Xiao et al. 2005). If, as we and others (Conrads et al. 2003) believe, panels of proteins provide the most promising avenue towards early and accurate cancer detection, then a re-examination of known candidates provides a logical approach to panel generation, with the expectation that a stream of new markers can be added as they are identifi ed by marker discovery studies. This candidate-based, or targeted, approach will require a comprehensive list of prioritized candidates coupled with a technology able to assay these in large sets of plasma and serum samples from clinical and epidemiological studies (together a "biomarker pipeline" (Anderson 2005b)).
Here we have begun to compile and prioritize a database of candidate biomarkers reported to be differentially expressed in studies of human cancer. We have included changes observed either at the protein (plasma or tissue) or nucleic acid (tissue DNA or RNA) level for any cancer, and excluded results restricted to animal, cell culture systems, or single case report studies in hopes of focusing on the most promising clinical biomarker candidates. We hypothesize that the protein version of most, if not all of these markers should be detectable in blood plasma at some level, irrespective of the tissue source, ultimately allowing for their use in patient screening, diagnosis or follow-up.

Search strategy
The principal strategy for creation of our list involved compilation of designated cancer related proteins from: our previously published work (Anderson et al. 2004), PubMed literature searches, cancer microarrays (868 proteins from 111 human cancer Superarrays (http://www.superarry. com and in supplemental material), Circulating Tumor Markers of the New Millennium (Wu 2002), American Association for Clinical Chemistry abstracts and general literature perusal. PubMed searches included de novo PubMed literature searches: [plasma (Title/Abstract) NOT membrane (Title/Abstract) NOT stimulation (Title/Abstract) NOT drug (Title/Abstract) NOT dose (Title/Abstract) AND protein (Title/Abstract) AND cancer] and ["cancer antigen" AND human], as well as the PubMed literature search used for proteins from other sources ["protein name" AND cancer AND human AND (where necessary) diagnostic AND (where necessary) expression] and PubMed "related article" searches. Only proteins for which we found at least one published study on human cancer utilizing primary samples were retained (639 of the array proteins). Each biomarker reference was then manually tabulated and curated as to disease and tissue (including plasma). Single case studies were excluded.

Citation analysis
Each documented protein on the resulting list was searched against the literature (via PubMed) using the query ["protein name" AND human AND cancer AND diagnostic]. This is admittedly a crude metric of research interest in a biomarker, but provides a useful method of relative prioritization among markers. In tabulating citation frequencies we did not exclude those categories ruled out in compiling the list initially: studies of animal systems, single clinical cases, or cell lines. If the "protein name" was not found by this search strategy it was counted as zero. It must be noted that PubMed is not a static archive but rather constantly changing both by additions, subtractions, and redefinition of MeSH headings. Still this exercise allowed some relative ranking of interest and therefore importance. Total cancer citations per year were determined using the query [human AND cancer AND diagnostic] limited to a specifi c publication year.

Annotation
Swiss-Prot/Uniprot accession numbers were obtained where possible. Most of the TrEMBL annotations were done prior to the addition of species information to the annotation number and so this form of the annotation was maintained. Candidate cancer biomarkers were annotated with GO numbers and IDs from EBI's human GOA 30.0 (gene_association.goa_human, ftp://ftp.ebi.ac.uk/ pub/databases) and the Gene Ontology's GO.def version 1.213 (http://www.geneontology.org/ontology/GO.defs) respectively. Similar ID groupings were then combined. The entire Human GO fi le was treated in an identical fashion for comparison with the candidates.

Protein concentrations
Where possible, normal or control values for the plasma concentration of each protein were obtained by literature search. Unless specifi cally noted, protein concentrations are for the intact protein not individual subunits.

Results
A search strategy combining literature search, extraction from microarray data, and a review of existing clinical tests, followed by manual curation, provided a list of 1261 candidate protein biomarkers (supplemental material) for which we found evidence of a quantitative change in some human cancer. As shown in Table 2, the candidates included proteins known to occur in plasma (274), proteins detected in tissue samples (542), and proteins whose corresponding mRNA or DNA levels were differentially expressed between cancerous and normal samples (656). These categories are non-exclusive in that a signifi cant number of the candidates were found in more than one type of study. Proteins detected in the plasma represent 22% of the total proteins documented to date.

Citation frequency
Citation frequency analysis was used as one method of prioritizing the biomarkers, on the assumption that proteins most widely studied in the context of cancer had more promise as biomarkers. Citation frequency was determined using a PubMed query intended to count citations in which the authors considered the proteins to have diagnostic value ( Figure 1, Table 3). When this is done, 29% of the 1,261 biomarkers have no such citations, 67% have fewer than 10, and 74% fewer than 20. Likewise only a very limited number of biomarkers have extensive citations, 62 proteins or only 5% of the total number of biomarkers were found to have greater than 500 citations.

Biomarkers with greater than 500 citations
Of the 34 biomarkers with more than 1000 citations (Figure 2, Table 3) 79% are found in the plasma and 56% are presently used clinically (89% of which are reported in plasma). Of the 28 markers with between 500 and 1000 citations (Figure 3) 57% are plasma proteins but only 7% are used clinically. Both of the markers used clinically are plasma proteins. Some proteins with high citation frequency (eg, albumin) are somewhat surprising to see in the context of cancer biomarkers; these have been retained nevertheless because they appear to have reasonable relevance (low serum albumin Figure 1. Biomarker Citation Frequency. Citation Frequency for each protein was determined using the PubMed query ["protein name" AND human AND cancer AND diagnostic]. Proteins were then histogrammed in bins of 10, 100 and 1000 citations (for frequencies of n<100, 100<n<1000, and n>1000, respectively) and each bin's count normalized through division by bin size (eg the count of proteins falling in the 11-20 citations bin was divided by 10).  (Ugurel et al. 2001), pituitary (Komorowski et al. 2000) and colorectal carcinomas (Davies et al. 2000). The assay is used primarily in the diagnosis and monitoring of patients with tumours of neuroendocrine origin. Increased levels in small cell lung cancer patients are associated with shorter survival (Pujol et al. 2003 The progesterone receptor is a steroid receptor which stimulates hormone-specifi c transcription of specifi c genes. There is a loss of expression in prostate cancer tissue (Ji et al. 2005 and triiodothyronine its level in plasma is used in the management of thyroid cancer (Whitley and Ain 2004).

Albumin
V-erb-b2, Her2/neu √ √ √ 3 yes P04626 1.1E+04 (Wu 2002) An oncogene product whose tissue expression and levels of the shed protein in serum have been shown to correlate with tumor stage in a range of adenocarcinomas (Tsigris et al. 2002).
Antigen identifi ed by An inhibitor of apoptosis Bcl-2 maintains CLL/lymphoma 2 homeostasis in the immune system The differing effects of Bcl-2 expression on prognosis may be due to which cells are expressing the Bcl-2, immune cells or tumors. High expression in ovarian cancer (Herod et al. 1996) and non small lung cancer (Shibata et al. 2004) are associated with better prognosis whereas well differentiated tumors more likely to be Bcl-2 positive (Soda et al. 1999 levels of sE-cadherin are found in sera of patients with bladder cancer and correlate with known prognostic factors. (Griffi ths et al. 1996).
Caspase 3 is involved in not only apoptosis execution but also proliferation. It has been shown to be downregulated in gastric lymphoma tissue but negatively associated with lymph node metastases in gastric carcinoma (Sun et al. 2004 et al. 1999) and migration of lymphocytes and macrophages may also enhance local growth and metastatic spread of tumor cells. Present in serum of normal individuals it is elevated in the serum from gastric and colon cancer patients, (Guo et al. 1994), Hodgkin's lymphoma patients (Lockhart et al. 1999), and acute leukemia patients (Yokota et al. 1999 proliferation, cell cycle checkpoints, and apoptosis. More than one half of all lung cancers contain a mutation of the p53 tumor suppressor gene (Johnson and Kelley 1993 Cyclins are in all proliferating cell types and collectively control the progression of cells through the cell cycle. Genetic alterations affecting p16(INK4a) and cyclin D1, proteins that govern phosphorylation of the retinoblastoma protein (RB) and control exit from the G1 phase of the cell cycle, are so frequent in human cancers that inactivation of this pathway may well be necessary for tumor development (Sherr 1996).
P21 is a cyclin-dependent kinase inhibitor that kinase inhibitor 1, p21 blocks cell cycle progression. It is suppressed in malignant nasopharyngeal epithelial cells (Fung et al. 2000), but overexpressed in pancreatic ductal adenocarcinoma (Hermanova et al. 2004 is released into the CSF when neural tissue is injured. Neoplasms derived from neural or neuroendocrine tissue may release NSE into the blood. Elevated levels are found in seminomas (Fossa et al. 1992), advanced non-small cell lung cancer (Barlesi et al. 2004), solid malignant tumors and malignant hematologic disorders (Burghuber et al. 1990).
Insulin √ √ 2 yes P01308 Serum insulin levels were clearly higher in patients with breast cancer than in patients with benign breast disease and healthy controls (Han et al. 2005 (Sacco et al. 2000), bladder (Mizutani et al. 1998), and colon cancer patients (Kushlinskii et al. 2001 (Kwak et al. 2004).
High levels in small cell lung cancer are associated with decreased survival (Bharti et al. 2004 Beta-catenin is necessary for the establishment and maintenance of epithelial layers. Accumulated cytoplasmic beta-catenin has been seen in esophageal squamous cell carcinoma   (Ehlenz et al. 1997).
alpha-2-HS-glycoprotein √ 1 yes P02765 6.1E+08 (Dickson Promotes endocytosis, possesses opsonic et al. 1983) properties and plays a role in bone metabolism; it is decreased in leukemia patients (Kwak et al. 2004 considerably reduced in early melanoma-derived cell lines, and barely detectable in advanced/metastatic cell lines (Shevde et al. 2002).
CA 27.29 √ 1 NF x A monoclonal antibody identifi ed cancer antigen most frequently used to follow response to therapy in patients with metastatic breast cancer (Perkins et al. 2003).
CA 72-4 √ 1 NF x A monoclonal antibody identifi ed cancer antigen useful in the diagnosis of breast (Skates et al. 2004) and pancreatic cancer (Jiang et al. 2004 et al. 2003) it is raised in hepatocellular cancer (Cohen 1988), infl ammation and infection (Hetet et al. 2003 (Ugurel et al. 2001) and myeloma (Sezer et al. 2001 A monoclonal antibody to epithelium-specifi c keratin 18 stained the majority of inner cells in benign breast lesions but comparatively fewer such cells in carcinoma in situ and invasive carcinoma (Rudland et al. 1993 (Smart et al. 1990 (Rizzatti et al. 2002) and in the serum of gastrointestinal stromal tumor patients (Bono et al. 2004 (Medl et al. 1995).

P12272
A critical regulator of cellular and organ growth, hormone-related development, migration, differentiation and protein survival and of epithelial calcium ion transport; parathyroid hormone-related protein is found in the serum of bone metastases (Iguchi et al. 2004), lung cancer (Nishigaki et al. 1999) patients and a multiple myeloma patient (Kitazawa et al. 2002).
Pcaf, P300/CBP- Pcaf plays a direct role in transcriptional associated factor regulation. The genes for p300, CBP, MOZ and MORF are rearranged in recurrent leukemia-associated chromosomal abnormalities (Yang 2004 et al. 1995) and temporally restricted tissue distribution, it is elevated in cancer patients especially patients with high C-reactive protein levels (Schenk et al. 1995 affi nity for sulfated polysaccharides and the kringle 4 of plasminogen, it is an independent prognostic factor in ovarian cancer (Begum et al. 2004 that can provide prognostic information on progression-free survival in leukemia patients (Hallek et al. 1996).  (Ryschich et al. 2004) and a sensitive serum measurement of erythropoiesis and iron defi ciency (Shih et al. 1990).
Tumor necrosis factor  et al. 1997) regulates blood coagulation, it is differentially expressed in ovarian cancer (Mor et al. 2005).
X box binding protein-1 √ 1 NF P17861 A transcription factor essential for hepatocyte growth, the differentiation of plasma cells, immunoglobulin secretion, and the unfolded protein response. It is increased in identical twins with multiple myeloma (Munshi et al. 2004) hXBP-1 mRNA expression was increased in primary breast cancers but hardly detectable in non-cancerous breast tissue (Fujimoto et al. 2003).
levels are prognostic of poor survival (Lis et al. 2003) as noted in the table contents).

Proteins with a large number or percentage of citations in 2004
In an effort to include more recently discovered biomarkers we also looked at the proteins that had greater than 100 Table 3), 32% are detected in plasma and none are presently being used clinically.

Time evolution of biomarker citations
We tracked the number of citations per year for selected cancer biomarkers over the last 35 years ( Figure 5). The number of times a protein was cited in a given year ("protein name" AND cancer AND human AND diagnostic) was divided by the total number of cancer citations for that year (cancer AND human AND diagnostic) to give a rough index of the prominence of the biomarker in cancer research. Although frequently cited in the 1970's and 1980's, interest in CEA has dropped dramatically. The most cited marker in this group, PSA, has well-documented limitations as a diagnostic yet it continues to be cited either as the only option or as the biomarker upon which to improve. Interest in most of these biomarkers evolves in a  fairly similar way: each appears to take a few years to be recognized, followed by gradually increasing interest over the following 15 to 20 years. Of these markers the FDA has approved only three as diagnostic cancer antigens: alpha-fetoprotein, CEA, and PSA (approved May 31, 1988, October 15, 1980and February 25, 1986 respectively; Figure 5).  (Table 4). None of these markers, used singly, has over 90% sensitivity and specifi city. Although these numbers are for specifi c assays, they are representative of the general lack of specifi city and sensitivity of the individual cancer markers currently available.

Concentration range of cancer plasma biomarkers
We attempted to collect normal plasma concentrations for candidate cancer biomarkers reported in the literature. The resulting 211 values were histogramed ( Figure 6) for comparison with the distributions of concentrations of either unselected plasma proteins from PPI's plasma protein database, or a set of candidate cardiovascular biomarkers (Anderson 2005a). The cancer candidates cover  a >10-log concentration range with proteins such as immune modulating interleukins (1α and β, 2, 5, 6, 9, 10, IFN-γ and GM-CSF) being present in normal plasma or serum in the pg/mL range while classical plasma proteins (albumin, transferrin, fi brinogen, and α-2-macroglobulin) are present at mg/mL levels. When the cancer candidate distribution is compared to the concentrations for all plasma proteins (unpublished results) and plasma markers of cardiac disease, a greater proportion of the cancer candidates appear in the lower concentration ranges than general plasma proteins or cardiac markers. Thus normal values for 185 (88%) of the markers for which we know the plasma concentration fall below 10 microgram/mL and 103 (49%) fall below 10 ng/mL. Tabulated concentrations are those found in controls not patients. Thus in many cases these may increase in cancer, thereby aiding in their detection.

Genome Ontology classifi cation of cancer candidate biomarkers
We compared the distribution of GO annotations for the cancer candidates with the distribution for all annotated human proteins over a series of summary categories, with the aim of fi nding any large biases in the cancer group. In comparing "Biological Process" GO annotation, the cancer biomarkers show an increased representation of apoptosis, cell cycle and proliferation annotations; processes blocked or increased in tumors ( Figure 7); while metabolism, catabolism and transport proteins are decreased. When the two sets are compared by "Cellular Component" GO terms (Figure 8), the extracellular category is over represented in the cancer biomarker database in comparison with the whole human database (20% versus 6% respectively). This is true even if the proteins found experimentally  in plasma are excluded (12% extracellular). The other Cellular Component categories show only small differences between the proteins sets.
Comparing "Molecular Function" GO terms, only small differences are apparent between the cancer candidates and the whole annotated human proteome.

Prioritization of candidates
Given the size of the list of candidates resulting from our assembly procedure, we attempted to select a smaller subset of higher priority candidates as a starting point for consideration of assay development and clinical validation. This subset comprising 260 proteins (Table 3) was compiled from the most highly cited proteins, the "recent" markers, plasma proteins of known concentration (indicating existence of an assay) and any marker presently in any type of clinical use. Many of these markers fall into expected categories such as immune modulation molecules (acute phase proteins, coagulation factors, immune modulators); and mediators of classical cancer pathways (oncoproteins, angiogenic or apoptosis factors, tumor suppressors or antigens, cellular homing or proliferation molecules). Somewhat less expected perhaps is that almost 22 (8%) of these top 262 proteins are involved in hormonal action.

Existence of a specifi c antibody
For each of the 260 high priority candidates, we performed web searches, primarily through the Exact Antigen website (www.exactantigen.com), to determine whether an antibody with potential utility in a plasma assay is commercially available. Relevant antibodies were found for 186 (72%) of the 260 high priority candidates.

Discussion
According to the Centers for Disease Control, 1 in every 4 deaths in the United States is due to cancer. Many of these deaths could be averted by improved early cancer detection, since existing therapies, especially surgery, are much more effective in early cancer stages as compared to later stages (Etzioni et al. 2003). Billions of dollars have been spent on basic research looking for molecular differences related to cancer-work that has been at least partly motivated by the need for improved in vitro diagnostic tests to detect or monitor progression of cancer. Yet to our knowledge no centralized database of known candidate cancer biomarkers exists. Such a list could serve to confi rm new results, eg, from proteomic comparisons of cancer and control sera, by placing them in a context of earlier work. Additionally it could serve as a reservoir of current and future candidates to be tested in large sample sets by candidate-based ("targeted" or "directed") proteomics methods. The latter use is important, since candidate-based methods, consisting of specifi c assays for defi ned targets, are likely to be much more sensitive than proteome profi ling methods, and hence could cover a much broader universe of protein candidates and potentially detect disease states earlier.
The present catalog of 1261 human candidate cancer biomarkers is a fi rst attempt at such a database. We did not select specifi c cancer types or specifi c detection methods, choosing instead to cast a broad net. In the resulting list, it will be apparent that the strength of evidence and likelihood of ultimate usefulness of the candidates varies widely. Even candidates that have been tested and found to have poor diagnostic specifi city and sensitivity were retained, as they may nevertheless contribute to useful panels as in the work of Mor and Xiao. Looking at the list, one might question why the most abundant plasma protein (serum albumin) is included -though perhaps counter-intuitive, albumin does meet the search criteria used, and is in fact a useful negative acute phase indicator likely to be altered in cancer along with many infl ammation-related proteins. Other well-known proteins not usually considered as cancer-specifi c are also included (eg, protein and peptide hormones overproduced by endocrine tumors or through ectopic synthesis). Overall, the list is not easily recognizable by inspection as a list of cancer markers.
Of the 1261 proteins, 22% are reported to occur in plasma. This is an appreciable fraction consider-ing that many of the large array studies, capable of fi nding many markers per experiment, have looked for differential protein or DNA expression in tissues. For bona fi de cell-associated cancer markers such as Her-2, there is persuasive evidence that at least a fragment of the protein molecule is released into the plasma and can be detected as a cancer biomarker (Tse et al. 2005), and other proteins documented here in the tissues of cancer patients have been demonstrated to be found in plasma in other disease indications. These cases provide some support for the hypothesis that most if not all of the 1261 proteins should be detectable at some level in plasma, the diagnostic sample of choice, given a sensitive enough assay. Whether current assay technologies will be sensitive enough to see a large fraction of the candidates in plasma is a major question at this point, and one that will require vigorous efforts to resolve.
As might be expected, there is a smooth distribution in the number of literature citations per candidate, ranging from almost 8,000 (for PSA) to zero (for candidates not mentioned as diagnostic by the publication's authors). This result suggests that our literature analysis did not identify a crisply defi ned set of cancer markers, but rather part of a continuum extending from a few established markers through plausible candidates into more speculative possibilities. Given the complexity of cancer, such an outcome is not surprising.
Only 5% of the 1261 candidates have been extensively studied (500 or greater total citations over the years). When examined as a function of time, the citation history of individual markers appears to show a slow evolution of interest that peaks 15 to 20 years after the initial papers. Only in the cases of CEA and PSA was discovery of a biomarker followed by a rapid increase in publications over a few years and in the case of PSA the steady increase was seen only 10 years after the fi rst citations appeared. Thus in order to catch recently emerged candidates, we focused on candidates with a high proportion of citations occurring in 2004 but with fewer total citations (often 10 or less). Of the total 1261 proteins only 41 are used in some clinical sense and even fewer have FDA approved assays.
While the observed slow pace is easily explained by the deliberate nature of clinical research and the progressive, rather than abrupt, nature of adoption in medical practice, it presents a stark reminder of the challenge involved in making any rapid advance in cancer diagnostics.
These candidate cancer markers, taken as a group, appear to be present in plasma at lower concentrations than comparable groups of cardiac markers or unselected plasma proteins. Although systematic biases in selection of these groups could affect this result, it tends to support the contention that plasma cancer marker discovery is, and may continue to be, a challenge in terms of detection sensitivity. Present discovery proteomics platforms typically detect proteins with plasma concentrations in the mg/mL to microg/mL range. For the proteins in our list with known plasma concentrations, we estimate that 86% would be missed by most conventional proteomics platforms, while 48% would be missed by high-end proteomics platforms with extensive multi-dimensional fractionation. For the present, the only way that many of these proteins can be detected is by specifi c assays: ie, by targeted proteomics. Targeted proteomics thus represents a preferred path to validation and further study of the candidate markers listed here.
The distribution of our cancer biomarker candidate proteins among GO annotation categories shows remarkable similarity to the distribution for all annotated human proteins. There is some enrichment for proteins annotated as related to apoptosis, cell cycle and proliferation (in the GO biological process category), as would be expected on account of the fundamental involvement of these processes in cancer. The extracellular group (in the GO cell component category) is also somewhat over-rep-resented, a trend favorable to detection in plasma. Nevertheless the candidates seem to represent a very wide sampling of the human proteome.
The full set of these 1,261 candidates is too large to submit for immediate verifi cation and validation in large sample sets by any available means, and some method of prioritization is required to initiate their evaluation. As an initial approach, we have selected a subset of the candidates based on a set of criteria including number of total citations, number of recent citations, proportion of recent citations, known plasma concentration (implying existence of an assay) and clinical use in any context. This subset of 260 candidates (presented in Table 3) includes 186 candidates for which a relevant antibody is commercially available, opening the possibility of testing this group using an antibody array or other miniaturized immunoassay technology in the near future.
While the list of candidate cancer biomarkers assembled here is clearly a simplistic and therefore somewhat crude initial catalog, we believe the result will prove to be of suffi cient value to justify extending the effort to provide an ongoing summary of the progress of cancer diagnostics. In particular we believe that linking a database of marker candidates to the bioinformatics architecture used in biomarker discovery will help to connect the discovery and validation phases (Anderson 2005b) necessary for progression of biomarkers to the clinic. One can envision a steady accumulation of candidates, regular revision of candidate priorities as evidence emerges from multiple sources (literature, microarrays, systems models, etc), and fi nally feedback in the form of specifi c measurements from validation studies in large sample sets. Such a collection of data would provide an up-to-date snapshot of the workings of a cancer diagnostic marker pipeline.
Finally, lists such as this prompt important, but infrequently-asked questions regarding the most productive tack for future discovery efforts. Is it reassuring to fi nd confi rmation of fresh observations through overlap with a pre-existing list? Perhaps so, and particularly if the candidates involved appear repeatedly in similar independent studies. However the sieve used here is crude and so our list cannot really "confi rm" a candidate seen in a new study-overlap just improves the odds of relevance. Further, since there are certain to be good cancer markers not on this list, failure to appear here in no way disqualifi es a novel marker.
Hence our hope is to contribute a mechanism for marginally improving chances of recognizing a valid marker, and a systematic source for enriched candidates available for validation and panel assembly efforts.